transport: drop outlier RTT samples above MaxReasonableRTTMs (30s) by 0pcom · Pull Request #2421 · skycoin/skywire

0pcom · 2026-05-03T19:08:29Z

Summary

Production observation post-#2415/#2418: three sudph transports show min/avg ~190ms but max ~33-35 seconds in TPD's /metrics. The min/avg are correct; the max is a single straggler pong arriving long after its ping — handleTransportPong correlates pong to ping only by timestamp, with no in-flight tracking that would discard a pong arriving after the implied timeout. Once it lands, Max only ever grows, so a single bad sample pins the transport's Max for the life of the lat:<id> key (35 days).

Real intercontinental RTT plus heavy queueing is comfortably under 5s, so 30s is a generous upper bound for "obviously bogus".

Fix

Mirror the existing zero-clobber lower-bound guard with an upper-bound at every layer:

transport.MaxReasonableRTTMs — exported constant, 30_000.0 ms.
SetLatency drops latencyMs > MaxReasonableRTTMs (alongside the existing <= 0).
SetLatencyStats rejects the whole snapshot if any of min/max/avg exceeds the bound.
handleTransportPong short-circuits before SetLatency, Trace-logs the dropped value.
RouteGroup.MeasureLatency drops outlier samples from measurements so a stale pong from the periodic pingLoop landing in the buffered pongCh can't poison an active measurement's min/max/avg.
RouteGroup.handlePongPacket rejects up front so neither networkStats, the synchronous pendingPongCh consumer, nor the underlying transport see the bogus value.

Test plan

go build ./... clean.
go vet ./pkg/transport/... ./pkg/router/... clean.
go test ./pkg/transport/ ./pkg/router/ ./pkg/transport-discovery/... all pass.
gofmt clean.
New regression tests:
- TestSetLatencyDropsOutliers — rejects the 35s production case, accepts exactly at the bound, valid samples after rejection still update the running stats.
- TestSetLatencyStatsDropsOutliers — table test rejects partial-outlier shapes (over-bound min/max/avg in any combination), accepts a fully-in-bound snapshot.
After deploy, in skywire cli tp metrics --by-transport: no transport's latency.max exceeds 30,000 ms; production cases like 51eced2f, f4c0e8c7, f95d6019 (currently min/avg ~190ms, max ~32-35s) settle to plausible max within ~5 minutes of a CXO push.

Followup to

#2415 (zero-clobber) and #2418 (persistent latency). The 35s outliers were the first thing visible once latency started persisting, since transient bad samples used to disappear with the registration TTL — they now stay around for 35 days.

Production observation: three sudph transports persisted with min/avg ~190ms and max ~33-35 SECONDS in /metrics. The min/avg are correct; the max is a single straggler pong arriving long after its ping (delayed packet, host suspended, NIC queue stall). Once it lands, Max only ever grows, so a single bad sample pins the transport's Max for the life of the lat:<id> key (35 days). handleTransportPong correlates pongs to pings only by timestamp — there's no in-flight tracking that would discard a pong arriving N seconds after timeout. Real intercontinental RTT plus heavy queueing is comfortably under 5s, so 30s is a generous upper bound for "obviously bogus". Mirror the existing zero-clobber lower-bound guard with an upper-bound at every layer: - transport.MaxReasonableRTTMs (exported, 30_000.0). - SetLatency drops latencyMs > MaxReasonableRTTMs (alongside <=0). - SetLatencyStats rejects the whole snapshot if any of min/max/avg exceeds the bound. - handleTransportPong short-circuits before SetLatency, with a Trace log of the dropped value. - RouteGroup.MeasureLatency drops samples >MaxReasonableRTTMs from `measurements` so a stale pong from the periodic pingLoop leaking into the buffered pongCh can't poison the active measurement's min/max/avg. - RouteGroup.handlePongPacket rejects up front so neither networkStats, the synchronous MeasureLatency consumer, nor the underlying transport see the bogus value. TestSetLatencyDropsOutliers and TestSetLatencyStatsDropsOutliers pin both the rejection and the boundary-exact acceptance.

…iminator (#2422) Production TPD was restarting every 30-40s (RestartCount 556 over 6h on the prod host) because two distinct panics tear down the process: 1. pkg/cxo/skyobject/cache.go (*Cache).Finc:1189,1216 panic: "Finc to negative for: <hash>" The filling-refcount went below zero — likely a duplicate Finc on a Filler.incs map, or an Inc/Finc mismatch across overlapping fillers. Hard process kill via panic. Filler.apply / Filler.reject already consume Finc's error return and just log; surface the inconsistency through that path instead. Clamp fc to 0, log the condition with key and the offending inc, and continue. Worst case is a leaked filling-item slot — orders of magnitude better than killing the service. 2. pkg/httputil/httputil.go WriteJSON:50 panic: "short write: i/o deadline reached" isIOError checks errors.Is(err, io.ErrShortWrite), but net/http's timeoutWriter returns its own error value with the same message string when a write deadline expires mid-response. errors.Is misses it (different sentinel), the fallback string match didn't include "short write", so getAllTransports' ~1MB JSON write to a slow client panics on every deadline hit. Added "short write", "i/o timeout", and "deadline exceeded" to the string-match fallback. New TestIsIOErrorShortWriteVariants pins all sentinel and string-match cases. Neither bug is caused by #2415/#2418/#2421; they were just made visible because deploys cycling at 30-40s aren't subtle. Together these stop the panic loop without changing any data semantics.

Visors running pre-#2421 binaries push outlier max samples (e.g. 35s RTT from a straggler pong) via CXO. TPD's UpdateLatency only had a lower-bound guard (avg<=0 or any field <=0 → drop), so those reach the lat:<id> store and pin Max for the 35-day retention regardless of how many later good samples land. Production right now: 5 transports show max>30s, all written ~10h ago, ages-out date 35d from each write. Won't go away on its own. Reject the same way the visor side does (transport.MaxReasonableRTTMs, 30s) so TPD's defense-in-depth matches the visor's. Old visors keep pushing bogus values; TPD now silently drops them, and the next good sample from any peer overwrites the stale stored max. The package already imports pkg/transport (via redis_transport.go) so MaxReasonableRTTMs comes free with no new dep.

0pcom merged commit 0a7e12b into skycoin:develop May 3, 2026
3 of 4 checks passed

0pcom mentioned this pull request May 4, 2026

fix tpd panic loop: cxo Finc-to-negative + httputil short-write discriminator #2422

Merged

6 tasks

0pcom mentioned this pull request May 4, 2026

transport-discovery: cap UpdateLatency at MaxReasonableRTTMs #2425

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

transport: drop outlier RTT samples above MaxReasonableRTTMs (30s)#2421

transport: drop outlier RTT samples above MaxReasonableRTTMs (30s)#2421
0pcom merged 1 commit intoskycoin:developfrom
0pcom:fix/transport-latency-rtt-upper-bound

0pcom commented May 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

0pcom commented May 3, 2026

Summary

Fix

Test plan

Followup to

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant